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ABSTRACT 

Many web databases can be seen as providing partial and overlap- 
ping information about entities in the world. To answer queries 
effectively, we need to integrate the information about the individ- 
ual entities that are fragmented over multiple sources. At first blush 
this is just the inverse of traditional database normalization problem 
- rather than go from a universal relation to normalized tables, we 
want to reconstruct the universal relation given the tables (sources). 
The standard way of reconstructing the entities will involve joining 
the tables. Unfortunately, because of the autonomous and decen- 
tralized way in which the sources are populated, they often do not 
have Primary Key - Foreign Key relations. While tables may share 
attributes, naive joins over these shared attributes can result in re- 
construction of many spurious entities thus seriously compromis- 
ing precision. Our system, SmartInt is aimed at addressing the 
problem of data integration in such scenarios. Given a query, our 
system uses the Approximate Functional Dependencies (AFDs) to 
piece together a tree of relevant tables to answer it. The result tu- 
ples produced by our system are able to strike a favorable balance 
between precision and recall. 

1. INTRODUCTION 

With the advent of web, data available online is rapidly increas- 
ing, and an increasing portion of that data corresponds to large 
number of web databases populated by web users. Web databases 
can be viewed as providing partial but overlapping information 
about entities in the world. Conceptually, each entity can be seen 
as being fully described by a universal relation comprising of all its 
attributes. Individual sources can be seen as exporting parts of this 
universal relation. This picture looks very similar to the traditional 
database set-up. The database administrator (who ensures lossless 
normalization) is replaced by independent data providers, and spe- 
cialized users (who are aware of database querying language) are 
replaced by lay users. These changes have two important implica- 
tions: 

• Ad hoc Normalization by providers: Primary key-Foreign 
key (PK-FK) relationships that are crucial for reconstruct- 
ing the universal relation are often missing from the tables. 
This is in part because partial information about the entities 




Figure 1: Overlapping tables in the database 

are independently entered by data providers into different ta- 
bles, and synthetic keys (such as vehicle ids, model ids, em- 
ployee ids) may not be uniformly preserved across sources. 
(In some cases, such as public data sources about people, the 
tables may even be explicitly forced to avoid keeping such 
key information.) 

• Imprecise queries by lay users: Most users accessing these 
tables are lay users and are often not aware of all the at- 
tributes of the universal relation. Thus their queries may be 
"imprecise" 1 16] in that they may miss requesting some of 
the relevant attributes about the entities under consideration. 

Thus a core part of the source integration on the web can be cast 
as the problem of reconstructing the universal relation in the ab- 
sence of primary key-foreign key relations, and in the presence of 
lay users. Our main aim in this paper is to provide a fully auto- 
mated solution to this problem. One reason this problem has not 
received much attention in the past is that it is often buried under 
the more immediate problem of attribute name heterogeneity: In 
addition to the loss of PK-FK information, different tables tend to 
rename their columns While many reasonable schema mapping 
solutions have been developed to handle the schema heterogeneity 
problem (c.f. [I] |2| [3] pi), we are not aware of any effective so- 
lutions for the reconstruction problem. In this paper (as well as in 
our implemented system) we will simply assume that the attribute 
name change problem can be handled by adapting one of the ex- 
isting methods. This allows us to focus on the central problem of 
reconstruction of universal relation in the absence of primary key- 
foreign key relationships. 
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4n other words, web data sources can be seen as resulting from an 
ad hoc normalization followed by the attribute name change 



1.1 Motivating Scenario 

As a motivating scenario, let us consider a set of tables (with dif- 
ferent schema) populated in a Vehicle domain (Figure [TJ. The uni- 
versal schema of entity 'Vehicle' can be described as follows: Ve- 
hicle (VIN, vehicle-type, location, year, door-count, model, make, 
review, airbags, brakes, year, condition, price, color, engine, cylin- 
ders, capacity, power, dealer, dealer-address) 

Let us assume that the database has the following tables: Ta- 
ble |7] with Schema SI - populated by normal web users who sell 
and buy cars, Table^with Schema S2 - populated by crawling re- 
views of different vehicles from websites, Table\3\with Schema S3 
- populated by engine manufacturers/vendors with specific details 
about vehicle engines and Table^with Schema S4. The following 
shows the schema for these tables and the corresponding schema 
mappings among them: SI - (make, model_name, year, condition, 
color, mileage, price, location, phone), S2 - (model, year, vehicle- 
type, body-style, door-count, airbags, brakes, review, dealer), S3 - 
(engine, mdl, cylinders, capacity, power) and S4 - (dealer, dealer- 
address, car-models) 

The following attribute mappings are present among the schema: 
(SI: model_name = S2: model = S3: mdl, S2: dealer = S4: dealer) 
The italicized attribute MID (Model ID) refers to a synthetic pri- 
mary key which would have been present if the users shared un- 
derstanding about the entity which they are populating. If it is 
present, entity completion becomes trivial because you can simply 
use that attribute to join the tables. There can be a variety of reasons 
why that attribute is not available: (1) In autonomous databases, 
users populating the data are not aware of all the attributes and may 
end up missing the 'key' information. (2) Since each table is au- 
tonomously populated, though each table has a key, it might not be 
a shared attribute. (3) Because of the decentralized way the sources 
are populated, it is hard for the sources to agree on "synthetic keys" 
(that sometimes have to be generated during traditional normaliza- 
tion). (4) The primary key may be intentionally masked, since it 
describes sensitive information about the entity (e.g. social secu- 
rity number). 



Table 1: Schema 1 - Cars(5^i) 



MID 


Make 


Model_name 


Price 


Other Attrbs 


HACC96 


Honda 


Accord 


19000 




HACV08 


Honda 


Civic 


12000 




TYCRY08 


Toyota 


Camry 


14500 




TYCRA09 


Toyota 


Corolla 


14500 





Table 2: Schema 2 - Reviews (&) 



Model 


Review 


Vehicle-type 


Dealer 


Other Attrb 


Corolla 


Excellent 


Midsize 


Frank 




Accord 


Good 


FuUsize 


Frank 




Highlander 


Average 


SUV 


John 




Camry 


Excellent 


FuUsize 


Steven 




Civic 


Very Good 


Midsize 


Frank 





Table 3: Schema 3 - Engine (5*3) 



MID 


Mdl 


Engine 


Cylinders 


Other Attrb 


HACC96 


Accord 


K24A4 


6 




TYCRA08 


Corolla 


F23A1 


4 




TYCRA09 


Corolla 


155 hp 


4 




TYCRY09 


Camry 


2AZ-FE 14 


6 




HACV08 


Civic 


F23A1 


4 




HACV07 


Civic 


J27B1 


4 





Consider the following representative queries (that SmartInt 
is aimed at handling): 

Ql: SELECT make, model 

WHERE price < $15000 AND cylinders = M'. 

Q2: SELECT make, vehicle-type 

WHERE price < $15000 AND cylinders = M'. 

Q3: SELECT * 

WHERE price < $15000 AND cylinders = M'. 



Table 4: Schema 4 - Dealer Info (6^4) 



Dealer 


Address 


Other Attrb 


Frank 
Steven 
John 


1011 E Lemon St, Scottsdale, AZ 
601 Apache Blvd, Glendale, AZ 
900 10th Street, Tucson, AZ 





Table 5: Results of Query Q just from Table Ti 



Make 


Model 


Price 


Honda 
Toyota 
Toyota 


Civic 
Camry 
Corolla 


12000 
14500 
14500 



The first thing to note is that all these queries are partial in that 
they do not specify the exact tables over which the query is to be 
run. Further more, note that both the query constraints and pro- 
jected attributes can be are distributed over multiple tables. In Ql, 
the constraint on price can only be evaluated over the first table, 
while the constraint on the number of cylinders can only be evalu- 
ated on table 3. In Q2, the projected attributes are also distributed 
across different tables. Finally, Q3 is an imprecise (entity comple- 
tion) query, where the user essentially wants all the information- 
spread across different tables-on the entities that satisfy the con- 
strains. Before we introduce our approach, let us examine the limi- 
tations of two obvious approaches to answer these types of queries 
in our scenario: 

Answering from a single table: The first approach is to answer the 
query from the single table which conforms to the most number of 
constraints mentioned in the query and provides maximum number 
of attributes. In the given query since 'make', 'model' and 'price' 
map onto Table [T] we can directly query that table by ignoring the 
constraint on the 'cylinders'. The resulting tuples are shown in 
Table [5] The second tuple related to 'Camry' has 6 cylinders and 
is shown as an answer. Hence ignoring constraints would lead to 
erroneous tuples in the final result set which do not conform to the 
query constraints. 

Direct (Naive) Join: The second and a seemingly more reasonable 
approach is joining the tables using whatever shared attribute(s) 
are available. The result of doing a direct join based on the shared 
attribute( 'model') is shown in Table [6] If we look at the results, 
we can see that even though there is only one 'Civic' in Table[T] we 
have two Civics in the final results. The same happens for 'Corolla' 
as well. The absence of Primary Key - Foreign Key relationship 
between these two tables has lead to spurious results. 

1.2 The SmartInt Approach 

As we saw, the main challenge we face is handling query con- 
straints as well as projected attributes that are spread across tables, 
in the absence of primary key-foreign key dependencies. Broadly 
speaking, our approach is to start with a "base table" on which 
most of the query constraints can be evaluated. The remaining 
query constraints, i.e., those that are over attributes not present in 
the base table, are translated onto the base table - i.e., approxi- 
mated by constraints over the base table attributes. The tuples in 
the base table that conform to the constraints (both native to the ta- 
ble, and those that are translated onto it) are the base tuples. After 
this "constraint translation" phase, we enter a "tuple expansion" 
phase where the base tuples are expanded by predicting values of 
any (projected or other) attributes that are not part of the base ta- 
ble. Both the constraint translation and tuple expansion phases are 
facilitated by inter- attribute correlations called "approximate func- 
tional dependencies" , as well as accompanying value associations 
that we mine (learn) from samples of the database tables. 

As an illustration of the idea, suppose the following simple AFDs 
are mined from our tables (note that the actual mined AFDs can 
have multiple attributes on the left hand side): (1) & : {model} 
vehiclejtype, (2) S2 '■ {model} review, (3) S3 ' {model} 
cylinders. Suppose we start with Table 1 as the base table. Rule 



Table 6: Results of Query Q using direct-join (Tl \xi T3) 



Make 


Model 


Price 


Cylinder 


Engine 


Other attrbs 


Honda 


Civic 


12000 


4 


F23A1 




Honda 


Civic 


12000 


4 


J27B1 




Toyota 


Corolla 


14500 


4 


F23A1 




Toyota 


Corolla 


14500 


4 


155 hp 





Table 7: Results of Query Q using attribute dependencies 



Make 


Model 


Price 


Cylinders 


Review 


Dealer 


Address 


Honda 


Civic 


12000 


4 


Very Good 


Frank 


1011 E St 


Toyota 


Corolla 


14500 


4 


Excellent 


Frank 


1011 E St 



3 provides us a way to translate the constraint on the number of 
cylinders into a constraint on the model (which is present in the 
first table). Rules 1 & 2 provide information on vehicle type and 
review for a given model, and hence provide more information in 
response to the query. They allow us to expand partial information 
about the car model into more complete information about vehicle 
type, review and cylinders. The results using attribute dependencies 
are shown in Table |7]and conform to the constraints and are more 
informative compared to other approaches. 

As shown in Figure [2] the operation of SmartInt can thus 
be understood in terms of (i) mining AFDs and value association 
statistics from different tables and (ii) actively using them to prop- 
agate constraints and retrieve attributes from other non-joinable ta- 
bles. Figure [2] shows the SmartInt system architecture. When 
the user submits a query, the source selector first selects the most 
relevant 'tree' of tables from the available set of tables. The tree of 
tables provides information about the base table (onto which con- 
straints will be translated), and additional tables from which ad- 
ditional attributes are predicted. Source selector uses the source 
statistics mined from the tables to pick the tree of tables. The 
tuple expander module operates on the tree of tables provided by 
the source selector and then generates the final result set. Tu- 
ple expander first constructs the expanded schema using the AFDs 
learned by AFDMiner and then populates the values in the schema 
using source statistics. 

Contributions The specific contributions of SmartInt system 
can be summarized as follows: (i) We have developed a query an- 
swering mechanism that utilizes attribute dependencies to recover 
entities fragmented over tables, even in the absence of primary key- 
foreign key relations, (ii) We have developed a source selection 
method using novel relevance metrics that exploit the automati- 
cally mined AFDs to pick the most appropriate set of tables, (iii) 
We have developed techniques to efficiently mine approximate at- 
tribute dependencies. We provide comprehensive experimental re- 
sults to evaluate the effectiveness of SmartInt as a whole, as well 
as its AFD-mining sub-module. Our expeirments are done on data 
from Google Base, and show that SmartInt is able to strike 
a better balance between precision and recall than can be achieved 
by relying on single table or employing direct joins. 

Organization: The rest of the paper is organized as follows. Sec- 
tion [2] discusses related work about current approaches for query 
answering over web databases. Section |3] discusses some prelimi- 
naries. Section |4]proposes a model for source selection and query 
answering using attribute dependencies. Section |5] provides details 
about the methods for learning attribute dependencies. Section [6] 
presents a comprehensive empirical evaluation of our approach on 
data from GOOGLE Base. Section |7] provides conclusion and fu- 
ture work. A prototype of SmartInt system has been demon- 
strated at ICDE 2010 1 19|. 

2. RELATED WORK 

Data Integration: The standard approaches investigated in the 
database community for the problem recovering information split 
across multiple tables is of course data integration \^6^7j. The ap- 




Figure 2: Architecture of SmartInt System 

proaches involve defining a global (or mediator) schema that con- 
tains all attributes of relevance, and establishing mappings between 
the global schema and the source schemas. This latter step can be 
done either by defining global schema relations as views on source 
relations (called GAV approach), or defining the source relations 
as views on the global schema (called LAV approach). Once such 
mappings are provided, queries on the global schema can be refor- 
mulated as queries on the source schemas. While this methodology 
looks like a hand-in-glove solution to our problem, its impractical- 
ity lies in the fact that it requires manually or semi-automatically 
established mappings between global and source schemas. This is 
infeasible in our context where lay users may not even know the set 
of available tables, and even if they do, the absence of PK-FK rela- 
tions makes establishment of sound and complete mappings impos- 
sible. In contrast, our approach is a fully automated solution that 
does not depend on the availability of GAV/LAV mappings. 

Entity Identification/Resolution: The Entity Identification prob- 
lem in heterogeneous databases is matching object instances from 
different databases that represent same real world entities. Instance 
Level Functional Dependencies! 8 1 are used to derive missing ex- 
tended key information for joining the tuples. Virtual attributes j9) 
are found to map databases in different databases. However, both 
these approaches require the tables to have initial key information. 
Also, it involves manual mappings from the domain experts to an 
extent. As opposed to this, SMARTlNTpredicts values for attributes 
that are not present in the table using mined AFDs. 

Keyword Search on Databases: The entity completion queries 
handled in SmartInt are similar in spirit to keyword queries over 
databases. This latter has received significant attention [|T0| [TT] 
12 1 . The work on Kite system extends keyword search to multiple 
databases as well |13 1. While Kite doesn't assume that PK-FK re- 
lations are pre-declared, it nevertheless assumes that the columns 
corresponding to PK-FK relations are physically present in the dif- 
ferent tables if only under different names. In the context of our 
running example. Kite would assume that the model id column is 
present in the tables, but not explicitly declared as a PK-FK rela- 
tion. Thus Kite focuses on identifying the relevant PK-FK columns 
using key discovery techniques (c.f. (14)). Their techniques do 
not work in the scenarios we consider where the key columns are 
simply absent (as we have argued in our motivating scenario). 

Handling Incomplete Databases & Imprecise Queries: Given a 
query involving multiple attributes, SmartInt starts with a base 
table containing a subset of them, and for each of the tuples in 
the base table, aims to predict the remaining query attributes. In 
this sense it is related to systems such as QPIAD 1 15 1. However, 
unlike QPIAD which uses AFDs learned from a single table to 
complete null- valued tuples, SmartInt uses AFDs both for trans- 
lating constraints onto the base table, and for expanding tuples in 
the base table by predicting query attributes not in the base table. 



Viewed this way, the critical challenge in SmartInt is the selec- 
tion of base table, which in turn is based on the confidences of the 
mined AFDs (see Section [4] 1). The constraint translation mech- 
anism used by SmartInt also has relations to constraint relax- 
ation approaches used by the systems aimed at handling imprecise 
queries (e.g. 1 16 |). 

Learning Attribute Dependencies: Though rule mining is pop- 
ular in the database community, the problem of AFD mining is 
largely under explored. Earlier attempts were made to define AFDs 
as an approximation to FDs (| 17 1, 1 14 1) with few error tuples failing 
to satisfy the dependency. In these lines, CORDs 1 17 1 introduced 
the notion of Soft-FDs. But, the major shortcoming of their ap- 
proach is, they are restricted to rules of the type C1^C2, where 
CI and C2 are only singleton sets of attributes. TANE 1 14| pro- 
vides an efficient algorithm to mine FDs and also talks about a 
variant of the FD-mining algorithm to learn approximate dependen- 
cies. But, their approach is restricted to minimal pass rules (Once a 
dependency of type (X^ Y) is learnt, the search process stops with- 
out generating the dependencies of the type (Z^Y), where XcZ. 
Moreover, these techniques are restricted to a single table, but we 
are interested in learning AFDs from multiple tables and AFDs in- 
volving shared attributes. In this paper, we provide a learning tech- 
nique that treats AFDs as a condensed representation of association 
rules (and not just approximations to FDs), define appropriate met- 
rics, and develop efficient algorithms to learn all the intra and inter- 
table dependencies. This unified learning approach has an added 
advantage of computing all the interesting association rules as well 
as the AFDs in a single run. 

3. PRELIMINARIES 

Our system assumes that the user does not have knowledge about 
different tables in the database and has limited knowledge about 
attributes he is interested in querying. This is a reasonable assump- 
tion, since most web databases do not expose the tables to the users. 
So we model the query in the following form where the user just 
needs to specify the attributes and constraints: Q =< A,C > 
where A are the projected attributes which are of interest to the user 
and C are the set of constraints (i.e. attribute-label, value pairs) 

Attribute dependencies are represented in the form of approx- 
imate functional dependencies. A functional dependency (FD) 
is a constraint between two sets of attributes in a relation from a 
database. Given a relation R, a set of attributes X in R is said 
to functionally determine another attribute Y, also in R, (written 
X ^ Y) if and only if each X value is associated with precisely 
one Y value. Since the real world data is often noisy and incom- 
plete, we use approximate dependencies to represent the attribute 
dependencies. An Approximate Functional Dependency (AFD) 
is an approximate determination of the form X A over relation 
R, which implies that attribute set X, known as the determining 
set, approximately determines A, called the determined attribute. 
An AFD is a functional dependency that holds on all but a small 
fraction of tuples. For example, an AFD model -w body_style 
indicates that the value of a car model usually (but not always) de- 
termines the value of body_style. 

Graph of Tables The inter-connections between different tables 
in the database are modeled as a graph. Each attribute match is 
represented as an undirected edge and any PK-FK relationship is 
represented as a directed edge pointing towards the table containing 
the primary key. 

4. QUERY ANSWERING 

In this section, we describe our query answering approach. We 
assume that attribute dependencies are provided upfront for the sys- 
tem. We outline our approach in terms of solutions to challenges 



identified earlier in Section [T] 

1) Information distributed across tables needs to be integrated: 

The information needs to be integrated since both answering queries 
with attributes spanning over multiple tables and providing addi- 
tional information to the user needs horizontal integration of the tu- 
ples across tables. In the absence of PK-FK relationships, perform- 
ing meaningful joins to integrate data is not feasible (as illustrated 
in Section [TJ. Instead we start with a 'base set of tuples' (from 
a designated base table chosen by the source selector) and suc- 
cessively expand those tuples horizontally by appending attribute 
values predicted by the attribute dependencies. This expansion is 
done recursively until the system cannot chain further or it recon- 
structs the universal relation. We use attribute determinations along 
with attribute mappings to identify attributes available in other ta- 
bles, whose values can be predicted using values of the selected 
attributes. 

2) Constraints need to be translated: The base table provides a 
set of tuples, i.e. tuples which conform to the query constraints. 
Generation of 'base set of tuples' requires taking into account con- 
straints on non-base tables. We use attribute mappings and attribute 
determinations for translating constraints onto the base table. Ba- 
sically, we need to translate the constraint on a non-base table at- 
tribute to a base table attribute through attribute determinations. In 
the example discussed in Section [T] suppose Ti is designated as a 
base table and T3 is a non-base table which has an AFD (model -w 
vehicle-type). If the query constrains the attribute vehicle-type to 
be 'SUV, then this constraint can be evaluated over the base table, 
if information about the likelihood of a model being an 'SUV is 
given. Attribute determinations provide that information. 

Now we explain how these solution approaches are embedded 
into SmartInt framework. Query answering mechanism involves 
two main stages: Source Selection and Tuple Expansion. We ex- 
plain these in detail in the next few sections. 

4.1 Source Selection 

In a realistic setting, data is expected to be scattered across a 
large number of tables, and not all the tables would be equally rel- 
evant to the query. Hence, we require a source selection strategy 
aimed at selecting the top few tables most relevant to the query. 
Given our model of query answering, where we start with a set of 
tuples from the base table which are then successively expanded, 
it makes intuitive sense for tuple expansion to operate over a tree 
of tables. Therefore source selection aims at returning the most 
relevant tree of tables over which the Tuple Expander operates. 

Given a user query, Q =< A,C > and a parameter 'k' (the 
number of relevant tables to be retrieved and examined for tuple 
expansion process), we define source selection as selecting a tree 
of tables of maximum size k which has the highest relevance to 
the query. The source selection mechanism involves the follow- 
ing steps: 1. Generate a set of candidate tables Tc = {T G 
T\relevance{T) > threshold}. This acts as a pruning stage, 
where tables with low relevance are removed from further consid- 
eration. 

2. Not all tables have a shared attribute. We need to pick a con- 
nected sub-graph of tables, Gc, with highest relevance. 

3. Select the tree with the highest relevance, among all the trees 
possible in Gc- This step involves generating and comparing the 
trees in G^ which can be computationally expensive if Gc is large. 
We heuristically estimate the best tree with the highest relevance 
to the query among all the trees. The relevance metrics used are 
explained below. 

We will explain how source selection works in the context of the 
example described in introduction. In order to answer the query 
Q, SELECT make, model WHERE price < $15000 AND 
cylinders = M',we can observe that the proj ected attributes 



make, model and constraint price < $15000 are present in Table 
[l]and constraint cylinders = '4^ is present in Tablejs] Given this 
simple scenario, we can select either Table [T] or Table]?] as the base 
table. If we select Table[3]as the base table, we should translate the 
constraint przce < $15000 from Table[l]to Table[3] using the AFD, 
model -w price. On the other hand if we designate Table[T]as base 
table, we would need to translate the constraint cylinders = '4^ 
from Table [3] to Table [T] using the AFD, model ^ cylinders. 
Intuitively we can observe that the AFD model -w cylinders gen- 
eralizes well for a larger number of tuples than model ^ price. 
Source selection tries to select the table which emanates high qual- 
ity AFDs as the base table and hence yield more precise results. 

Here we discuss the different relevance functions employed by 
the source selection stage: 

Relevance of a table: The relevance of a table T depends on two 
factors: (i) the fraction of query-relevant attributes present in the 
table and-we can view this as "horizontal relevance" and (ii) the 
fraction of tuples in the table that are expected to conform to the 
query -we can view this as "vertical relevance". We evaluate rele- 
vance as follows: 



Algorithm 1 Source Selection 



relevance{T, q) 



\A\ 



* Prr{C) * tupleCountq 



where the first factor is measuring the horizontal relevance and the 
other two estimate the vertical relevance. Specifically, Prr{C) is 
the probability that a random tuple from T conforms to constraints 
C, tupleCountr is the number of tuples in T, and Ar is the set 
of attributes in 

Relevance of a tree: While selecting the tree of relevant tables, the 
source selection stage needs to estimate the relevance of tree. The 
relevance of tree takes into account the confidence of AFDs ema- 
nating out of the table. Relevance of a tree Tr rooted at table T 
w.r.t query Q < A^C > can be expressed as: relevance{Tr, q) — 
relevance{T, q) + pred_accuracy{a) where Ab are 

the set of attributes present in the base table, and pred_accuracy{a) 
gives the accuracy with which the attribute a can be predicted. 
When the attribute is in the neighboring table it is equal to the con- 
fidence of AFD and when its not in the immediate neighbor its 
calculated the same way as in AFD chaining (Explained in Section 

The above relevance functions rely on the conformance proba- 
bility Pr{C) = U^Pr{Ci). Pr{Ci) denotes the probability that 
a random tuple from T conforms to the constraint d (of the form 
X — v), and is estimated as: 



• Pr{C^) = Pr[X : 
attributes in T 



v),ifX G Ar, where Ar is the set of 



• Pr(a) = E^ Pr{Y = v^) * Pr'{X = v\Z = v^), if T : 
y = : Z, i.e. T's neighboring table T' provides attribute 
X. (These probabilities are learnt as source statistics.) 

• Pr{Ci) = e (small non-zero probability), otherwise 

In this section we explained the source selection mechanism. 
We discuss how the tuple expansion mechanism answers the query 
from the selected sources in the next section. 

4.2 Tuple Expansion 

Source selection module gives a tree of tables which is most rel- 
evant to the query. Tuple expansion operates on the tree of tables 
given by that module. One of the key contributions of our work 
is returning the result tuples with schema as close to the univer- 
sal relation as possible. We need to first construct the schema for 

^Presently we give equal weight to all the attributes in the system, 
this can be generalized to account for attributes with different levels 
of importance. 



Require: Query q. Threshold r. Number of tables k. Set of AFDs 
A 

1: Tc = m 

2: for all table T in T do 

3: if relevance(T, q) > r then 

4: add r to Tc 

5: Gc '.= Set of connected graphs over Tc up to size k 

6: Trees = {^} 

7: for all g e Gc do 

8: Trees g = Set of trees from graph g 

9: add Treesg to Trees 

10: trecsei = arguidiXtreeeTrees relevance(tree, q) 
11: return treCsei 

Projected Attributes Constrained Attribute 
Base Table-*'- TableX 




Constrained* Attribute 



Figure 3: Expanded attribute tree for the query 



the final result set and then populate tuples that correspond to that 
particular schema from other tables. These steps are described in 
detail in the sections that follow. 

4.2.1 Constructing the Schema 

One important aspect of tuple expansion is that it is a hierar- 
chical expansion. The schema grows in the form of a tree be- 
cause attributes retrieved from other tables are relevant only to the 
determining attribute(s) (refer to the definition of AFD in Section 
[3]). This module returns a hierarchical list of attributes, AttrbTree, 
rather than a flat list. This is more clearly illustrated by the attribute 
tree generated for query discussed in Section[T] shown in Figure [3] 
The base table (Ti) contains attributes Make, Model, Price. Tables 
Ti, T2 and T3 share the attribute Model. In table T2, we have the 
AFDs Model ^ Cylinders and Model ^ Engine. These two deter- 
mined attributes are added to the base answer set, but these are only 
relevant to the attribute 'model', so they form a branch under the 
attribute 'model'. Similarly, review, dealer and vehicle type form 
another branch under 'Model'. In the next level, T3 and T4 share 
'dealer-name' attribute. 'Dealer-Name' is a key in T4, therefore 
all the attributes in T4 ('dealer-address', 'phone-number' etc) are 
attached to the AttrbTree. The final attribute tree is shown in the 
Figure [3] 

4.2.2 Populating the Tuples 

The root of the selected tree of tables given by the source selec- 
tion is designated as the base table. Once the attribute hierarchy 
is constructed, the system generates a 'base set' of tuples from the 
base table which form the 'seed' answers. We refer to this base 
set as the most likely tuples in the base table which conform to the 
constraints mentioned in the query. We call them 'most likely' tu- 
ples because when constraints are specified on one of the children 
of the base table, we propagate constraints from child to base table. 
But since we have approximate dependencies between attributes. 



the translated constraints do not always hold on the base set. To 
clearly illustrate this, let us revisit the example of Vehicle domain 
from Section[T] We assume that Table[T]has been designated as the 
base table. The constraint price < $15000 is local for the base ta- 
ble and hence each tuple can be readily evaluated for conformance. 
The constraint cylinders = '4\ on the other hand, is over Table [3] 
and needs to be translated on to the base table. Notice that these 
two tables share the attribute 'model' and this attribute can approx- 
imately determine cylinder in Table |3]( model -w Cylinders ). 
{model ^ Cylinders) implies that the likelihood of a model hav- 
ing certain number of cylinders can be estimated, which can be used 
to estimate the probability that a tuple in Table 1 would conform to 
the constraint Cylinders = '4^ We can see that model 'Civic' is 
more likely to be in the base set than 'Accord' or 'Camry'. 

Once the base tuple set has been generated, each of those tuples 
are expanded horizontally by predicting the values for the attributes 
pulled from children tables. Given a tuple from the base set, all 
the children tables (to the base table) are looked up for determined 
attributes, and the most likely value is used to expand the tuple. 
Further, values picked from the children tables are used to pick 
determined attributes from their children tables and so on. In this 
way, the base tuple set provided by the root table is expanded using 
the learned value dependencies from child tables. 

In tuple expansion, if the number of shared attributes between 
tables is greater than one, getting the associated values from other 
tables would be an interesting challenge. For instance, in our run- 
ning example. Table [T] also had the year attribute and Table [2] is 
selected as the base table. We need to predict the value of price 
from Table [T] If we consider both Model and Year to predict the 
price, results would be more accurate, but we do not have the val- 
ues of all combinations of Model and Year in Table [T]to predict the 
price. However, if we just use Model to predict the price, the pre- 
cision might go down. Another interesting scenario where taking 
multiple attributes might not boost the prediction accuracy is the 
following: Model, Numberjires -w Price is no better than Model 
^ Price. In order to counter this problem, we propose a /a// back 
approach of the AFDs to ensure high precision and recall. 

This method can be formally described as this: If X is the set 
of shared attributes between two tables Ti and T2, where Ti is the 
base table and T2 is the child table. We need to predict the values 
of attribute Y from T2 and populate the result attribute tree. If the 
size of X is equal to n (n > 1), we would first start with AFDs 
having n attributes in determining set and 'significantly higher' 
confidence than any of their AFDs. We need 'significantly higher' 
confidence because if the additional attributes do not boost the con- 
fidence much, they will not increase the accuracy of prediction as 
well. If the AFDs do not find matching values between two tables 
to predict values, we 'fall back' to the AFDs with smaller deter- 
mining set. We do this until we would be able to predict the value 
from the other table. Algorithm[2]describes it. 

5. LEARNING ATTRIBUTE DEPENDENCIES 

We have seen in the previous section how attribute dependencies 
within and across tables help us in query answering by discovering 
related attributes from other tables. But it is highly unlikely that 
these dependencies will be provided up front by autonomous web 
sources. In fact, in most cases the dependencies are not apparent 
or easily identifiable. We need an automated learning approach to 
mine these dependencies. 

As we have seen the Section|4] we extensively use both attribute- 
level dependencies (AFDs) and value level dependencies. The value 
level dependencies are nothing but association rules. The notion of 
mining AFDs as condensed representations of association rules is 
discussed in detail in 1 18 1. Our work adapts the same notion, since 
it helps us in learning dependencies both at attribute and value level. 

The following sections describe how rules are mined within the 



Algorithm 2 Tuple Expansion 

Require: Source-table-tree Su Result-attribute-tree At, Set of 
AFDs A 

1: R '.= {0} {Initializing the result set with schema At } 

2: b'.= Root (St) {Setting the base table} 

3: Translate the constraints onto base table 

4: Populate all the attributes in level of At from b 

5: for all child c in At do 

6: if 6 and c share n attributes then 

7: fd = AFDs with n attrbs in detSet 

8: while n > do 

9: if c has the specified combination then 

10: Populate R using predicted values using fd from c 

11: break 

12: fd = Pick AFDs with n-1 attributes in detSet 

13: return Result Set 



table and how they are propagated across tables. 

5.1 Intra- table learning 

In this subsection we describe the process of learning AFDs from 
a single table. It is easy to see that the number of possible AFDs 
in a database table is exponential to the number of attributes in it, 
thus AFD mining is in general expensive. But, only few of these 
AFDs are useful to us. To capture this, we define two metrics con- 
fidence and specificity for an AFD, and focus on AFDs that have 
high confidence and low specificity values. 

5.7.7 Confidence 

If an Association rule is of the form (a ^ it means that if we 
find all of a in a row, then we have a good chance of finding /3. The 
probability of finding (3 for us to accept this rule is called the con- 
fidence of the rule. Confidence denotes the conditional probability 
of head given the body of the rule. 

Generalizing to AFDs, the confidence of an AFD should simi- 
larly denote the chance of finding the value for the dependent at- 
tribute, given the values of the attributes in the determining set. 
We define confidence in terms of the confidences of the underly- 
ing association rules. Specifically, we define it in terms of picking 
the best association rule for every distinct value-combination of the 
body of the association rules. For example, if there are two associ- 
ation rules (Honda ^ Accord) and (Honda ^ Civic), given Honda, 
the probability of occurrence of Accord is greater than the proba- 
bility of occurrence of Civic. Thus, (Honda ^ Accord) is the best 
association rule, for (Make = Honda) as the body. 

N 

confidence (X-^ A) — V^arg max (support (ax) x 

Confidence{ax ^ Py)) 

Here, N denotes the number of distinct values for the determining 
set X in the relation. This can also be written as, 

n' 

Confidence (X-^ A) — y arg max {support {ax) ^ l^y) 

Example: For the database relation displayed in table |8] Confi- 
dence of the AFD {Make -w Model) = Support {Make : Honda -w 
Model : Accord) + Support {Make : Toyota ^ Model : 
Camry) = | | = |. 

5.7.2 Specificity-based Pruning 

The distribution of values for the determining set is an impor- 
tant measure to judge the "usefulness" of an AFD. For an AFD 
X -w A, the fewer distinct values of X and the more tuples in 



Table 8: Fragment of a Car Database 



ID 


Make 


Model 


Year 


Body Style 


1 


Honda 


Accord 


2001 


Sedan 


2 


Honda 


Accord 


2002 


Sedan 


3 


Honda 


Accord 


2005 


Coupe 


4 


Honda 


Civic 


2003 


Coupe 


5 


Honda 


Civic 


1999 


Sedan 


6 


Toyota 


Sequoia 


2007 


SUV 


7 


Toyota 


Camry 


2001 


Sedan 


8 


Toyota 


Camry 


2002 


Sedan 



the database that have the same value, potentially the more rel- 
evant possible answers can be retrieved through each query, and 
thus a better recall. To quantify this, we first define the support of 
a value m of an attribute set X, support (ai), as the occurrence 
frequency of value a^ in the training set. The support is defined as 
support{ai) = count {ai)/N, where N is the number of tuples 
in the training set. 

Now we measure how the values of an attribute set X are dis- 
tributed using specificity, specificity is defined as the information 
entropy of the set of all possible values of attribute set X: {ai, a2, 
. . . , am }, normalized by the maximal possible entropy (which is 
achieved when X is a key). Thus, specificity is a value that lies 
between and 1 . 

— support (ai) X \og2 {support (ai)) 



specificity (X) 



When there is only one possible value of X, then this value has 
the maximum support and is the least specific, thus we have speci- 
ficity equals to 0. When all values of X are distinct, each value has 
the minimum support and is most specific. In fact, X is a key in 
this case and has specificity equal to 1 . 

Now we overload the concept of specificity on AFDs. The speci- 
ficity of an AFD is defined as the specificity of its determining set. 
i.e. specificity (X A) = specificity (X). The lower specificity 
of an AFD, potentially the more relevant possible answers can be 
retrieved using the rewritten queries generated by this AFD, and 
thus a higher recall for a given number of rewritten queries. 

Intuitively, specificity increases when the number of distinct val- 
ues for a set of attributes increases. Consider two attribute sets X 
and Y such that YdX. Since Y has more attributes than X, the 
number of distinct values of Y is no less than that of X, specificity 
(Y) is no less than specificity (X). 

Monotonicity of specificity For any two attribute sets X and Y 
such that YdX, specificity (Y) > specificity (X). Thus, adding 
more attributes to the attribute set X can only increase the speci- 
ficity of X. Hence, specificity is monotonically increasing w.r.t 
increase in the number of attributes. 

This property is exploited in pruning the AFDs during the min- 
ing, by eliminating the search space of rules with specificity less 
than the given threshold. 

Algorithms for mining AFDs face two costs: the combinatorial 
cost of searching the rule space and the cost of scanning the data to 
calculate the required metrics for the rules. In query processing the 
AFDs which we are mostly interested are the ones with the shared 
attributes in determining set of the rule. If X -w A is an AFD, we 
are interested in rules where X ^ S, where S is the set of shared 
attributes between two tables. Since number of such attributes is 
typically small, we can use this as one of the heuristics to prune 
away irrelevant rules. 

5.1.3 AFDMiner algorithm 

The problem of mining AFDs can be formally defined as follows: 
Given a database relation r, and user-specified thresholds minConf 



(minimum confidence) and max specificity (maximum specificity 
), generate all the Approximate Functional Dependencies (AFDs) 
of the form (X -w A) from r for which confidence (X-w A) > 
minConf and specificity (X) < max specificity 

To find all dependencies according to the definition above, we 
search through the space of non-trivial dependencies and test the 
validity of each dependency. We follow a breadth first search strat- 
egy and perform a level-wise search in the lattice of attributes, 
for all the required AFDs. Bottom-up search in the lattice starts 
with singleton sets and proceeds upwards level- wise in the lattice, 
searching bigger sets. For AFDs, the level-wise bottom-up algo- 
rithm has a powerful mechanism for pruning the search space, es- 
pecially the pruning based on specificity . 

Search starts from singleton sets of attributes and works its way 
to larger attribute sets through the set containment lattice level by 
level. When the algorithm is processing a set X, it tests AFDs of 
the form X\A^ A, where A G X. 

Algorithm 3 AFDMiner: Levelwise search of dependencies 

1: Lo:=m 

2: Li := {{A} \ A e R} 

3: i:=l 

4: while / do 

5 : ComputeDependencies At ALeveKL^ ) 
6: PRUNE(L^) 

7: L^+i := GenerateNextLeveKL^) 
8: i:=i+\ 

Algorithm [3] briefly presents the main AFDMiner algorithm. 
In it, GenerateNextLevel computes the level L^+i from Li. The 
level L^+i will contain only those attribute sets of size which 
have their subsets of size iinLi. (ComputeDependenciesAtALeveKL^)) 
computes all the AFDs that hold true at the given level of the lat- 
tice. In this process, it computes the confidence of eah association 
rule constituting the AFDs. PRUNE(L£) implements the pruning 
strategies and prunes the search space of AFDs. It computes the 
specificity of each rule, and if it is less than the specified threshold, 
eliminates all the rules whose determining sets are supersets of it. 

5.2 Learning source statistics 

Storing association rules: The probabilities which we used ex- 
tensively in the query answering phase are nothing but the con- 
fidence of the association rules. So we store all the association 
rules mined during the process of AFD mining (specifically, in 
ComputeDependenciesAtALeveKL^))) and use them at query time. 
This saves us the additional cost of having to compute the associa- 
tion rules separately by traversing the whole lattice again. 

Here we describe the value level source statistics gathered by 
the system, which are employed by the query answering module 
for constraint propagation and attribute value prediction. As men- 
tioned earlier, AFD mining involves mining the underlying associ- 
ation rules. During association rule mining, following statistics are 
gathered from each source table T: (i) Pr{X = Xi): Prior prob- 
abilities of distinct values for each attribute X in Ar (ii) Pr{X = 
Xi\Y = yj): Conditional probabilities for distinct values of each 
attribute X conditioned on those of attribute Y in Ar- Recall that 
this is nothing but the confidence of an association rule. Only the 
shared attributes are used as evidence variables, since value pre- 
diction and constraint propagation can only be performed across 
shared attributes. 

5.3 Inter-table Chaining 

After learning the AFDs within a table, we need to use them to 
derive inclusion dependencies which are used in query answering 
phase. In order to combine AFDs from different tables, we need an- 
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CI 



Table 3(T3) 



AFDs : 

A0~>A1; A0~>A2 
B1~>B3; B2~>B3 



Schema Mappings: 
T1 :A1 = T2:B1 
T1 :A2 = T2:B2 
T2:B3 = T3:C1 



Figure 4: Inter- table Learning 
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Figure 5: Precision vs. Number of Constraints 

chor points. These anchor points are provided by the attribute map- 
pings across tables, so we extend our attribute dependencies using 
them. When two AFDs between neighboring tables are combined, 
the resultant AFD would have a confidence equal to the product of 
the two confidences. 

But when we are combining dependencies between tables which 
are not directly connected, we need to consider all the possibilities. 
Let us consider the scenario in Figure |4] with three tables Ti , T2 
and T3. Ti and T2 have mappings between attributes Ai — Bi and 
A2 — B2. Similarly T2 and T3 have mapping between B3 — Ci. If 
we want to get the most likely value of Ci for Aq, we have more 
than one chaining to consider. We need to consider the confidences 
of AFDs, Ao Ai, Ao -w A2 as well as the confidences of 
AFDs, Bi ^ B3, B2 ^ B3. We cannot greedily pick the AFD 
with higher confidence in either Ti or T2 . We need to pick a com- 
bination of the AFDs which have higher cumulative confidence. 

6. EXPERIMENTAL EVALUATION 

A prototype of SmartInt system, as described in this paper, 
has been implemented. The prototype supports automatic mining 
of approximate functional dependencies and value associations in 
an off-line phase. It also ranks the answer tuples it returns in terms 
of the overall confidence associated with each tuple. The prototype 
system has been demonstrated at ICDE 2010 1 19 1. 

Our intent is to evaluate the effectiveness of SmartInt in terms 
of precision and recall measures. The following explains how pre- 
cision and recall measures are computed to take into account the 
fact that SmartInt 's answers can differ from ground truth(provided 
by the master table) both in terms of how many answers it returns 
and how correct and complete each answer is. 

Correctness of a tuple (art) is defined as the fraction of its at- 
tribute values that are correct. Precision of the result set (Prs) is 
defined as the average correctness of the tuples in the result set. 
Similarly, completeness of a tuple(cpt) is defined as the fraction of 
the attributes of the real tuple that were returned. Recall of the re- 
sult set (Rrs) is defined as the average completeness of the tuples 
in the result set. 



Recalivs Constraints 




-DirectJoin 

-Single 

-Snriartlnt 



Figure 6: Recall vs Number of Constraints 



Precision vs Attributes 



— DirectJoin 

-Single 

-SmartInt 



Figure 7: Precision vs Number of Attributes 

Vehicles database. We used around 350,000 records probed from 
Google Base for the experiments. We created a master table with 
18 attributes. We divided this master table into multiple child tables 
with overlapping attributes. This helps us in evaluating the returned 
'result set' with respect to the results from master table and estab- 
lish how our approach compares with the ground truth.We have 
divided the master table into 5 different tables with the following 
schema 



• Vehicles _Japanese: (condition, pricejype, engine, model, VIN, ve- 
hicle _type, payment,door_count, mileage, price, color , body_style, 
make) 

• Vehicles _Chevrolet: (condition, year, price, model, VIN, payment, 
mileage, price, color, make), 

• Vehicles _Chevrolet_Extra: (Model, Door Count, Type, Engine) 

• Vehicles _Rest: (condition, year, price , model, VIN, payment, mileage, 
price, color, make) 

• Vehicles _Re St _Extra: (Engine, Model, Vehicle Type, door count, body 
style) 
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Experimental Setup: To evaluate the SmartInt system, we used 



Figure 8: Recall vs Number of Attributes 
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Figure 9: F-measure vs Number of Attributes 
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Figure 10: F-measure vs Number of Constraints 

The following (implicit) attribute overlaps were present among 
the fragmented tables. 

• Vehicles _Chevrolet: Model ^ Vehicle s_Rest: Model 

• Vehicles _Chevrolet:Year Ve hie les_Rest: Year 

• Vehicles _Rest:Year ^ Vehicles _Rest_Extra:Year 

• Vehicles_Chevrolet_Extra:Model ^ Vehicles_Rest_Extra:Model. 

The following are the input parameters which are changed: (1) 
Number of Attributes and (2) Number of Constraints. We mea- 
sured the value of precision and recall by taking the average of the 
values for different queries. While measuring the value for a partic- 
ular value of a parameter we varied the other parameter. While we 
are measuring precision for 'Number of attributes = 2', we posed 
queries to the system with 'Number of constraints = 2, 3 and 4' and 
took the average of all these values and plotted them. Similarly, we 
varied the 'Number of attributes' while we are measuring the Pre- 
cision for each value of 'Number of constraints'. The same process 
is repeated for measuring the recall as well. 

Comparison with 'Single table' and 'Direct join' approaches : 

In this section, we compare the accuracy of SmartInt with 
'Single table' and 'Direct join' approach which we discussed in 
[T] and analyze them. Recall that in the single table approach, re- 
sults are retrieved from a single table which has maximum num- 
ber of attributes/constraints mentioned in the query mapped on it. 
The direct join approach involves joining the tables based on the 
shared attributes. As explained in the introduction, the latter ap- 
proach tends to generate spurious entities, while the former also 
fails to draw together the connected information about the entity. 

In the simple case of queries mapping on to a single table, the 
precision and recall values are independent of attribute dependen- 
cies, since query answering does not involve constraint propagation 
or tuple expansion through attribute value prediction. 

In cases where queries span multiple tables, some of the attribute 
values have to be predicted and constraints have to be propagated 
across tables. Availability of attribute dependency information al- 
lows accurate prediction of attributes values and hence boosts pre- 
cision. As shown in Figures [5] and [7] our approach scored over the 
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Figure 11: SmartInt vs Multiple join paths 
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Figure 12: Precision, Recall and F-measure vs tuple width 

other two in precision. Direct join approach, in absence of primary- 
foreign key relationships, ends up generating non-existent tuples 
through replication, which severely compromises the precision. In 
cases where query constraints span over multiple tables, single ta- 
ble approach ends up dropping all the constraints except the ones 
mapped on to the selected best table. This again results in low pre- 
cision. 

In terms of recall (Figures [6] and |8]), performance is dominated 
by the direct join approach, which is not surprising. Since direct 
join combines partial answers from selected tables, the resulting 
tuple set contains most of the real answers, subject to complete- 
ness of individual tables. Single table approach, despite dropping 
constraints, performs poorly on recall. The selected table does not 
cover all the query attributes, and hence answer tuples are low on 
completeness, which affects recall. 

When accurate attribute dependencies are available, our approach 
processes the distributed query constraints effectively and hence 
keeps the precision fairly high. At the same time, it performs chain- 
ing across tables to improve the recall. Figures [9] and show that 
our approach scores higher on F-measure, hence suggesting that it 
achieves a better balance between precision and recall. 

Comparison with multiple join paths: 

In the previous evaluation the data model had one shared at- 
tribute between the tables, but there can be multiple shared at- 
tributes between the tables. In such scenarios, direct join can be 
done based on any combination of the shared attributes. Unless 
one of the attribute happens to be a key column the precision of 
the joins is low. In order to illustrate this, we considered the data 
model with more than one shared attribute and measured the pre- 
cision and recall for all the possible join paths between the tables. 
The experimental results (See Figure[TTJ show that SmartInt had 
higher F-measure than all possible join paths. 

Tradeoffs in number vs. completeness of the answers: 

Normal query processing systems are only concerned about re- 
trieving top-k results since the width(number of attributes) of the 
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Figure 13: Time taken by AFDMiner vs Length of AFD 
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Figure 14: Time taken by AFDMiner vs No. of tuples 

tuple is fixed. But SmartInt chains across the tables to increase 
the extent of completion of the entity. This poses an interesting 
tradeoff: In a given time, the system can retrieve more tuples with 
less width or fewer tuples with more width. In addition to this, 
if user is only interested high confidence answers, each tuple can 
expand to variable width to give out high precision result set. We 
analyze how precision and recall varies with w (the number of at- 
tributes to be shown). The Figure [12] shows how precision, recall 
and F-measure varies as more number of attributes are predicted 
for a specific result set (the query constraints are make='BMW' 
and year='2003'). In scenarios, when SmartInt has to deal with 
infinite width tuples, F-measure can be used to guide SmartInt 
when to stop expanding. 

Learning Time (AFDMiner): 

We invoke AFDMiner to learn the association rules and the AFDs. 
But this is done offline before query processing starts. So learning 
time usually does not directly affect the performance of the sys- 
tem. Nevertheless, the current implementation of AFDMiner uses 
several optimizations and data preprocessing to keep learning time 
low. In fact, AFDMiner takes only about 4 seconds for mining the 
rules used in the current experimental setup. Figure [14] and [13] 
show the comparison between the time taken for AFDMiner with 
specificity threshold set to 0.5 and 1, with varying tuplesize and 
the length of the AFD respectively!^ We see that specificity metric 
results in faster learning times. For a detailed experimentation on 
AFDMiner, refer to |18 |. 

7. CONCLUSION AND FUTURE WORK 

Our work is an attempt to provide better query support for web 
databases having tables with shared attributes using learned attribute 

^At first blush, pruning highly specific AFDs seems to hurt the 
precision, but in the current set of experiements specificity based 
pruning reduced the total running time and did not effect the accu- 
racy. 



dependencies but missing primary key - foreign key relationship. 
We use learned attribute dependencies to make up for the missing 
PK-FK information and recover entities spread over multiple ta- 
bles. Our experimental results demonstrate that approach used by 
SmartInt is able to strike a better balance between precision and 
recall than can be achieved by relying on single table or employing 
direct joins. 

We are currently exploring a variety of extensions to the SmartInt 
system. These include (i) differentiating the importance of the at- 
tributes in tuple expansion (ii) allowing variable width answers, and 
assessing the diminishing rewards of additional information using 
a discounted reward model and (iii) considering vertical fragmen- 
tation of tables in addition to horizontal fragmentation (which will 
involve operating with a set of base tables rather than a single one). 
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